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Maximizing Multi-Information 



Stochastic interdependence of a probability distribution on a product space is measured 
by its Kullback-Leibler distance from the exponential family of product distributions (called 
multi-information). Here we investigate low-dimensional exponential families that contain 
the maximizers of stochastic interdependence in their closure. 

Based on a detailed description of the structure of probability distributions with globally 
maximal multi-information we obtain our main result: The exponential family of pure pair- 
interactions contains all global maximizers of the multi-information in its closure. 
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1. Introduction 

The starting point of this article is a geometric interpretation of the interdependence^ 
of stochastic units. In order to illustrate the basic idea, we consider two units with 
the configuration sets fli = SI2 = {0, 1}. The configuration set of the whole system 
is just the Cartesian product Oi x 2 = {(0, 0), (1, 0), (0, 1), (I ,_!)}. The set of 
probability distributions (states) is a three-dimensional simplex V(fli x f2 2 ) with 
the four extreme points <5( WllW2 ), u)\, u>2 G {0, 1} (Dirac measures). The two units are 
independent with respect top € V{£l\ x CI2) iff 

p(uji,lu 2 ) — ^1(^1)^2(^2) for all (wi, uj 2 ) G ^1 X Sl 2 - (1.1) 

The set of factorizable distributions (jl.ip is a two-dimensional manifold T . Figure 1 
shows the simplex P(f2i x f2 2 ) and its submanifold T . 



Given an arbitrary probability distribution p, we quantify the interdependence of 
the two units with respect to p by its Kullback-Leibler distance from the set T . In 
our two-unit case, this distance is nothing but the well known mutual information, 
which has been introduced by Shannon [Shj as a fundamental quantity that provides 
a measure of the capacity of a communication channel. 

Motivated by so-called Infomax principles within the field of neural networks 
[Li) ITSEj , one of us has investigated maximizers of the interdependence |Ayl| |Ay2| 



1 Throughout the paper we use the term interdependence to indicate stochastic dependence 
among units, as opposed to dependence of general random variables. 
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Figure 1: The exponential family T in the simplex of probability distributions. 

of stochastic units. In our two-unit example, these are the distributions 
\ (£(o,o) + <*(i,i)) > and \ (£(1,0) + £(o,i)) (see Figure 1). 

This article continues that work by analyzing the structure of maximizers of stochas- 
tic interdependence. In particular, this leads to some answers to the question on 
the existence and the structure of a natural low dimensional manifold that con- 
tains all maximizers of the stochastic interdependence (see |Ayl| , 3.4 (ii) and |Ay2| , 
4.2.3). We will prove that the exponential family of pure pair-interactions contains 
the global maximizers of multi-information in its closure. In our example of two bi- 
nary units this exponential family is given by the convex hull of the two maximizers 
\ (£(o,o) + £(i,i)) and \ (<J( 1)0 ) + £(o,i)) shown in Figure 1. 

In physics, pair interactions are considered as fundamental mechanisms that un- 
derly most theories. Within the field of neural networks, the physical concept of 
pair-interactions is used to model the synaptic interactions of neurons. 

2. Notation 

Let O be a nonempty and finite set. In the corresponding real vector space M. n , we 
have the canonical basis e u , to £ fl, which induces the natural scalar product (•,•). 
The set of probability distributions on O is denoted by V{Q)\ 



V(Q) := {p = (pH) wen £ R n : p(«) > for all w, and E, e!J ^) = l} ■ 
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For a probability distribution p, we consider its support suppp := {lu G fl : p{oj) > 
0}. The strictly positive distributions V(£l) have maximal support fl: 

V(fl) := {peP(Sl) : suppp = ft}. 

Note that V{fl) is the closure of V{fl). For every vector X = (X(w)) we n G R n , we 
consider the corresponding Gibbs measure: 

cxp(X)eV(n) , exp(X)(cj) := ^ xlu ,y 

The image exp(T) of a linear (or more generally affine) subspace T of M. n with 
respect to the map X i— > exp(X) is called exponential family (induced by T). 
In this article, we are mainly interested in the "distance" of probability distributions 
from a given exponential family £. More precisely, we use the Kullback-Leibler 
divergence or relative entropy D : V(£t) x V(fl) — > [0,oo) U {oo}, 

(p,q) » D{p || q):={ S w6B upp,P(") ln S< * suppp C supp <?, ^ 
I oo , otherwise 

to define the continuous^ function Dg : V(Q,) — » R+, 

P ' ^ MsO) := inf £>(p|| q). 

q££ 

For fc G N we denote the set {1, . . . , k} by [kj. 



3. Sufficiency of Low-Dimensional Exponential Families for the Maxi- 
mization of Multi-Information 

We consider the set V := [N] = {1,...,N} of N > 2 units, and corresponding 
sets Oj, i G [-/V], of configurations. The number |Oj| of configurations of a unit i is 
denoted by nj. Without restriction of generality we assume 

2 < n-i < n% < • • • < un. 

For a subsystem A C [TV], the set of configurations on A is given by the product 
Qa := Xjg^fij. One has the natural restriction 

Xa : — * j (^i)ie[iV] ^ (^i)ieAj 
which induces the projection 

where pa denotes the image measure of p with respect to the variable Xa- For 
i G [N] we write instead of P{i}- 

A probability distribution p G 'P(Qy) is called factorizable if it satisfies 

= Pi(wi) • . . . -pn(un) for all (a>i, . . . , oj/v) G fiy. 

2 See Lemma 4.2 of |Ayl| for a proof. 
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The set T of strictly positive and factorizable probability distributions on Qy is an 
exponential family in Viflv) with 

N 

dimJ- = / ^(fij — 1). 

i=l 

Now let us consider the function Djr, which measures the distance from T . We have 
Djr(jj) — if and only if p € V(Slv) is factorizable. Thus, this distance function can 
be interpreted as a measure that quantifies the stochastic interdependence of the 
units in [N]. The following entropic representation of Djr is well known (see | Amj ) : 

JV 

I P {X 1 ,...,X N ) D r {p) = Y / H p( x i)- H p( x u---> X N)- 

i=l 

Here, the H p (Xi)'s denote the marginal entropies and H p (Xi, . . . , Xn) is the global 
entropy. This measure of stochastic interdependence of the units, which is called 
multi-information, is a generalization of the mutual information (see example in the 
introduction) . 

This article deals with the problem of finding natural low-dimensional exponential 
families that contain the maximizers of the multi-information in their closure. To 
this end we first consider a result on maximizers of the distance from an arbitrary 
exponential family |Ayl| , in the improved form obtained in [MAj : 

Prop. 3 of [MA]. Let £ be an exponential family in "P(f2) with dimension d. Then 
there exists an exponential family £* , £ C £*, with dimension less than or equal to 
3d + 2 such that the topological closure of £ * contains all local maximizers of Dp. 

This theorem is quite general, and is based on the observation that maximizers 
of the information divergence D$ have a reduced cardinality of their support, which 
is controlled by the dimension d of £ . The direct application of Prop. 3 of |MA] to 
the exponential family T leads to the following statements on the local maximizers 
of the multi-information I{X\, . . . . Xn) = Df. 

Corollary 3..1 There exists an exponential family T* with 

N 

dim T* < 3^(^-1) + 2 < 3N(n N - 1) + 2 

i=l 

that contains all local maximizers of I(X\, . . . ,Xn) in its topological closure. 
In particular, in the binary case ni — 2 for all i, dim T* < 3A^ + 2. 

In all such statements about exponential families over product spaces one should keep 
in mind, that the dimension of the exponential family V(flv) itself is of exponential 
growth in the number N = \V\ of units. So any exponential subfamily which is of 
polynomial growth in is of large codimension. 
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Our main goal is now the following. Knowing about the existence of such low- 
dimensional exponential families T* , we want to analyze the relation between them 
and exponential families given by interaction structures between the N units. 

More precisely, this article deals with the problem whether one can find low- 
dimensional exponential families T* like in the Corollary 13.. II that are at the same 
time given by a low order of interaction. Before going into the details, we state an 
informal version of the main result of the paper (using terminology from statistical 
physics) : 

Informal Version of Theorem 15. .It If the cardinalities m, . . . ,tin fulfill an 
inequality (see Theorem ^.. J$ , the exponential family of pure pair-interactions (that 
is, pair-interactions without any external field) is sufficient for generating all global 
maximizers of the multi-information. 

Let us have a closer look on this result for the binary case. In this case, the 
exponential family of pure pair-interaction has dimension N — 1, which is stronger 
than Corollary 13. .11 More important, the pair interactions form an explicit low 
dimensional exponential family that appears in many models in physics and biology 
(the units being called particles respectively neurons, the interactions fields resp. 
dendrites). 

In Section 5, we will provide a rigorous formulation of our main result and prove 
it. This will be based on results concerning the structure of global maximizers of 
multi-information, which is discussed in the following Section 4. 

4. The Structure of Global Maximizers of Multi-Information 

4.1. General Structure 

Obviously, the maximal value of I(Xi, . . . , Xn) is bounded as 

JV N 

I P (X 1: ...,X N ) = ^H p {Xi) - H P (X 1 ,...,X N ) < 

i=l i=l 

In fact, it turns out that in contrast to the quantum setting (see Remark |4 . . 2 1 below) . 
this upper bound is never reached. The following lemma gives an upper bound that 
is sharp in many interesting as well as important cases. 

Lemma 4..1 Let p be a probability distribution on Qy — Qi x • • ■ x fl^. Then: 

N-l 

I p (X 1 ,...,X N ) < ^lnK)- (4.1) 

i=l 



Remark 4.. 2 With an orthonormal basis fx, . . . , /„ of the Hilbert space C" we con- 
sider the (entangled) unit vector 

N N 

» — > 



v k=l i=l t=l 
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and the density operator p defined by the orthogonal projection onto the subspace 
spanned by ip. In this setting, the mutual information is extended as 

N N 

I{p) = J2 S (Pi)~ S (p) = tr(pln(p))-^tr( Pi ln(pi)) 

i=l i=l 

where S denotes von Neumann entropy, and the pi are the partial traces of p. As we 
see, this multi-information has the value 7Vln(n) 7 which, according to Lemma \^..l\ 
is not possible within the classical setting. 



In the following, we consider the set 

{N~l 
P eV(n v ) ■ i p {x 1 ,...,x N ) = J2Hn i ) 
i=l 

of probability distributions that maximize, according to Lemma 14.. 11 in the case 
■ . ■ , Ojv) 7^ the multi-information I{X\, . , . , Xn). Up to isomorphism, 
everything depends only on the cardinalities n, = |f2,-| so that we sometimes write 
M(ni, . . . , n^r) instead of M(fli, . . . , fijv). 

The next theorem characterizes the probability distributions in .M(f2i, . . . , CIn)- 



Theorem 4.. 3 Let p be a probability distribution on VLy . Then p G A4(Qi, . ..,Q/v) 
if and only if there exist a probability distribution G V(Qn) and surjective maps 
TTi : Qm — * i = 1, • • • ) N — 1, with 

P W {TTi = «i} = — (wi G (4.2) 

smc/i i/iai /or all . . . , uin) G fly 

- { , otto™** (43) 

Theorem 14.. 31 allows us to say precisely under which conditions on the unit sizes 
Hi the theoretical maximum (|4.ip of multi-information can be achieved (we use the 
shorthands W :— 2\- N ~ 1 \{®} and ■— (^i)ieA an d denote the greatest common 
divisor by GCD): 

Theorem 4. .4 We have ■ ■ ■ , f2jv) 7^ if and only if tin > Umin /or 

™min = «min(ni, . . . , TlJV-l) != (- 1) 1 A| ~ 1 GCD (n A ) • 

Remarks 4. .5 _Z. In particular, A4(£li, . . . , f2jv) 7^ */ 
(a,) i/iere are oraZy N — 2 units, or 
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(b) all units are identical {n% — . . . = tin)- 

In the following Sections \4-. 2. \ and \4- S.\ we discuss these two important examples 
of Theorem \4--4\ niore precisely. 

2. (a) We have the following inequalities for n mul : 

N-l 

max(ni, . . . , rijv-i) < "min <l+/](nj- 1). 

i=l 

These follow immediately from the defining relation n m j n = | UteW-i] I 

for T m := : i G [m]} , since \T ni \ = m and 1 € T„ 4 . 

XTie Ze/iS inequality becomes an equality iff the least common multiple 

LCM(nrjv— ii) = fiN—l (still assuming that Tij+i > n{), whereas the right 

inequality becomes an equality iff the integers n%, . . . , Un-i are mutually 

prime. 

(b) Additionally, one gets 

«min < LCM(n [A r_i]) =: I, 

since for all i E [TV — 1] i/ie inclusion T ni C Tj ftoZrfs £rwe. Again we have 
equality iff LCM(rirjv_i]) = Jij\r— i- 

fcj T/ie global maximizers p G A4(fli, . . . , f2jv) o/ multi-information that we 
construct simultaneously maximize the mutual information of the pairs 
{i, N} of units. 

In the case LCM(nrjv_ii) = tin they even simultaneously maximize the 

mutual information of all pairs {i,j} C [N] of units. 

Both statements follow from direct inspection of p defined in J6'.^[ ). 

4.2. The Case of Two Units 

We now discuss the case of two units, i.e. N = 2. In this case, the set 

M{Qi,n 2 ) = {?eP(!lix!] 2 ) : I P (X 1 ,X 2 ) =Mm)} 

is non-empty and therefore consists of all global maximizers of the mutual informa- 
tion of the two units. We want to describe the structure of .M(f2i, f^) by stratifying 
it into a disjoint union of relatively open sets. In order to do that, we consider for 
ill '■— U {0} the following set of maps 

S := {tt : 2 fl* : 7r(f7 2 ) D fit}. (4.4) 

The relation 

a ~< n <t _1 (o;i) C 7r _1 (ti)i) for all lj% G f2i 

on S is a partial order which makes S a poset. 
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Figure 2: The posets for = {1, 2}, n 2 = {1, 2, 3}. 



Example 4. .6 For f2i = {1,2} and fl 2 = {1,2,3} we get a poset S of 12 maps. 
The right graphics in Figure 2 shows the cover graph of the poset with vertex set S. 
On the left we show the graphs of four of these maps. We have a < it if a is in the 
lower line and connected to 7r in the upper line (so-called Hasse diagram). 

We call a poset connected iff its cover graph is connected. 



Lemma 4. .7 The poset \4-4\) * s connected if and only if n\ < n 2 . 
Given 7r S S we consider the convex and relatively open set 



p G V(rii x fi 2 ) : for all w x € fii, 



w 2 e 



p(wi,w 2 ) = — and p(wi, wa) > iff -k(u> 2 ) = wi \. 

Til 

We denote by S m ^ n the Stirling numbers of the second kind (see for example | Aij ) . 
Theorem 4.. 8 

(1) The set of global maximizers of the mutual information is a disjoint union 

M{VL u tt 2 ) = |+J A^(fti,r> 2 ) 

of sets Mtt^i,^). 

(2) These sets have dimension 

dimMrtni.fia) = k _1 (^i)|-|^i|, 

and there are ni!(^ 2 )iSj, ni sets .M^ili, O2) of dimension l—n\. 

(3) The inclusion M a (ni,Q 2 ) C M^^i,^) holds if and only if a ~< it, and the 
set yVl (Sli , Sl 2 ) is connected if and only if n\ < n 2 . 
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Figure 3: The structure of A* (2, 3). 



Example 4. .9 Continuing Example 14.. 61 for nx = 2 and ri2 = 3 the set .M(2,3) is 
the disjoint union of six points and six open intervals (see Figure 3, left), combined 
in the form of a hexagon (see Figure 3, right). So M.(2, 3) is homeomorphic to S 1 
in this case. 



4.3. The Case of AT Equal Units 

This section deals with the important example of N units with ri\ = ■ ■ ■ = njv =: n. 
In that situation, Theorem 14.. 31 has the following direct implication. 

Corollary 4. .10 The set M(£li, . . . , fi/v) consists of all probability distributions 
I V x 

/ t °(7ri(wjv),...,7rjv-i(a) N )."Jv) ' 

where iti : f2jv - > i = 1, . . . ,iV — 1, are one-to-one mappings. This implies 

\M(n lt ...,il N )\ = (nl) N ~\ (4.5) 
and for all p € A^(f2i, . . . , f2jv) ; 

I p (X 1 ,...,X N ) = (N - 1) • ln(rc), 

|suppp| = n. (4-6) 

Thus according to (|4.5|) . the number of the maximizers of the multi-information 
grows exponentially in N. In particular, for binary units the set M.{il\, . . . , Ojv) 



10 



Nihat Ay, Andreas Knauf 



has 2 elements. In view of this fact, it is interesting that according to Corollary 
13. .11 there is an exponential family of dimension < 3A^ + 2 that approximates all 
these global maximizers of the multi-information. This bound can even be improved. 
Although it is not our main goal to do that we close this subsection by an interesting 
iV-independent upper bound, which implies that for N binary units there exists an 
exponential family with dimension less than or equal to 5 that approximates all 
2 N ~ 1 elements of M{Q. U . . . , Cl N ). 



Theorem 4.. 11 There exists an exponential family with dimension less than or 
equal to (n 2 + 3n)/2 that contains A4(fli, . . . , fijv) in its closure. 



This exponential family, however, is based on multibody interactions (in terms of 
statistical mechanics) between the units i € [N] . 



5. Sufficiency of Low-Order Interaction for the Maximization of Multi- 
Information 

Given a subset A C [N] = {1, . . . , N}, we decompose u> £ fly in the form u> = 
(to a, ^[n]\a) with oja G ^a, <jJ[n]\a G ^[at]\j4- We define I a to be the subspace of 
functions that do not depend on the configurations uj[n]\a- 

2a := {feR Qv : 

f(uJA,Vv\A) = f&A, u' [N] \ A ) for all u) A € Q A , and all U[ N ]\ A , u[ N ]\ A S M[n]\a 

The orthogonal projection II ^ onto this ll^-dimensional space with respect to the 
canonical scalar product 

(f, g ) := ]T /MsM (^eR fiv ) 

in K v is given by 

U A (f){uJA,^[N]\A) ■= ,77^ 1 V] f(uA,u[ N] \ A ). 

In order to describe only the pure contributions of A to a function /, we "subtract" 
the contributions from subsets B C A. This leads to the Ilie/i (l^il — l) -dimensional 
subspace 

t a ■= i A n fl i b 

\bca 

and the orthogonal decomposition Mp v = ®a<z[n]-^a- Denoting the orthogonal 
projections onto 1a by 11,4 we thus have n^IIs = Sa,b^a and 

n A = ac[n], (5.1) 

BCA 
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and every vector / has a unique representation as a sum of orthogonal vectors: 

/ = £ n A (/). 

AC [N] 

The /a is called (pure) interaction among the units in A. With the Mobius inversion 
(|5.1[) implies 



Ha(/) = ^(-i) |A \ B| n s (/) 

BCi 



BCA 



Now we construct exponential families associated with such interaction spaces. The 
most general construction is based on a set of subsets of [N]. Given such a set 
AC 2^, we define the corresponding interaction space by 

Ia ■= ©?A. (5-2) 
AeA 

which generates the exponential family exp(ZA). We want to apply this definition 
to the more specific situation of interactions with fixed order k. Therefore, we define 

X' fc) := 3{ac[n] ■. \A\<k}, and 1^ := I{ac[n] ■, \A\=k}- 
We get the flag of vector spaces 

M=XW C T« C X< 2 ) ... ClW=M^, 
and the corresponding hierarchy of exponential families 

exp(2< >) C exp(Z«) C exp(X( 2 )) ■•• C exp^W) = P(fiy), 

Here, exp(Z(°)) contains exactly one element, namely the center of the simplex. 

The exponential family exp(2W) is nothing but the exponential family T of fac- 
torizable distributions. Thus, the multi-information vanishes exactly on the topo- 
logical closure of exp(zW). 

Now we determine for a nonempty set . . . , fijv) of maximizers the lowest 

order k such that A4(Qx, . . . , Ojv) is contained in the topological closure of exp(X^ c )). 
The first possible candidate for this is given by k — 2. The following theorem states 
that this is also sufficient. 



Theorem 5..1 There exists an exponential family T* C exp(X ( - 2 - ) ) of dimension 
dim(J-*) = (npj — 1) Sta^ 71 * — 1) containing in its closure all global maximizers of 
the multi-information (M.(Qi, . . . , Qpf) C T*). 
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This theorem represents our main result which we already stated informally in 
Section 3. Note that compared with Theorem 14 .. 1 1 1 for large N Theorem 15 . . 1 1 leads 
to an exponential family T* of higher dimension. On the other hand, we still have 
an exponential (in N) codimension in the simplex V(£ly). 

In addition to that, the exponential family of Theorem 1 5 .. 1 1 represents a concrete 
model that appears in many applications in physics and biology. For instance, 
within the field of neural networks, the exponential family exp(l( 2 )), which contains 
exp(Z( 2 )) as a subfamily, is known as the family of Boltzmann machines, |AHS[|A"Kl 
AKN . Applied to this context, our result states that Boltzmann machines are able 
to generate all distributions that have globally maximal multi-information, and that 
their dimensionality (^) is not minimal for N > 2. 

Examples 5. .2 

(1) The Case of Two Units. In this case, the hierarchy of interactions ends with 
k = 2, because we have just two units. Thus the simplex V(fli x Jig) is equal to the 
exponential family exp(I^ 2 ^), which has dimension n\U2 — 1. The codimension of 
the subfamily exp(X^ 2 ^) of Theorem 15 .. 1 1 then is ri\ +ri2 — 2. Applied to our example 
of two binary units from the introduction, we see that 

dim(exp(I (2) )) = 1 

In Figure 1, we obtain this family by simply taking the convex combinations of the 
two maximizers: 

exp(X (2) ) = | (8(o,o)+ + ^ (<5(i,0) + <W)) : 0<A<1 

(2) The Case of N Equal Units. According to Theorem I4..3I for |0,-| = n wc 
have |A4(Oi, . . . , Qn)\ = (nl) 1 ^^ 1 maximizers, which are, according to Theorem l5..11 
contained in the closure of an exponential family T* of pure pair interactions, with 

dim(J"*) = (N — l)(n — l) 2 . 

6. Proofs 

We fix the following notations: For V' C [N], Hv> denotes the entropy of the 
random variable Xy- Obviously Hy = H, and H^y = Hi. For two subsets 
V',V" C [N], Hiyn iy/) is the conditional entropy of Xy» given Xy. For V' = 
{a 1) ...,a L } and V" = {b 1 ,...,b M } we also write ■ff(6 ll ...,& M |oi,...,ai;) instead of 
H (V"\V) = H ({b 1 ,...,b M }\{a 1 ,...M L })- Now let Vi,...,V r be a set of disjoint sub- 
sets of [N] = {1, . . . ,N}. The multi-information of these subsystems is given by 
I{Vi,...,v r } — J2j=i-^Vj — Hviti)—\£V r - In the case where the subsets of [N] have 
cardinality one, we also write Iu x j \ instead of {i }}■ We obviously have 

Iy=L 
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Proof of Lemma I4..11 

By the chain rule H(X, Y) = H(X) + H(Y | X) 



N 



I P (X\, . . . , Xn) = 2_ j H p {Xi) — H p (Xi, . . . , Xjy) 

i=l 

N-l 

= ^ Hp(Xi) — (H p (Xi, . . . ,Xn) — H p (Xn)) 
N-l N 

= J2 h p( x ^ - h p( Xi > ■ ■ -> x n-i i x N ) < H p( x i) ^ J2 ln ( 



i=l 

N-l N-l N-l 

i—1 i—1 i—1 



proving the lemma. □ 

Proof of Theorem [4^31 

If a probability distribution p on Sly has the form (|4.3p with a distribution p( N > £ 
V^n) and surjective maps 7T; : il^ —> that satisfy (|4.2p . then I(p) = Xa=i m ( n i) : 



N 



i=l 
N 

= y2 H i(p) - H N {p)-H {1 \ N) {p) - #( 2 |l,jV)(p) ff(W-l|l,2,...,iV-2,iV)(p) 

i=l v ' — ' 

=0 

N-l 
i=l 

Nowwe prove the opposite implication. Therefore we assume I(p) — J^Li 1 ln(n,*). 
This gives us 

Hi{p) = ln(n 2 ) (t = l,...,iV-l). (6.1) 

Otherwise the existence of an i £ {1, . . . , N — 1} with H io (p) < ln(nj ) would 
imply the following contradiction 

N 



i=l 
N-l 

= H i(p) + H N (p)- (Hn(p) + H(i,...,n-i\n)(p)) 

i=l 

N-l N-l 

From (|6.ip we have 



i=l 
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N 



N-l 



H(p) = J2 h *(p)-Hp) = +#jv(p) )-J2Hni) = H N (p). (6.2) 



i=l 



Now we set := pn, and define a Markov kernel K : (Oi X • • • X r^TV — 1 ) X 
[0, 1] by 

1 iiPiv(wjv) = 



il-iiv-l 



In these definitions we get 



H(p)-H N (p) 



pjv(^jv)>o 

^ if (wi, . . . , wat_i I wat) In (p N (uj N ) K(ui, uj n ~i | w/v)) I 

x ■ ■ ■ x rijv — i 

5] p w (^iv)-H"(if(-|^iv)) > 0. 



pn( uj n)>° 



From (|6.2j) this implies H(K{- \ ujn)) = for all cjat with pn{un) > 0. This 
implies the existence of maps 7Tj : £!jv — > &i with 



iV-l 



. . . ,uj n ) = p {N) (uj N ) Y[ Svim 



i=l 



Because of Hi(p) = ln(rii) for all i e {1, . . . , N — 1}, these maps must be surjec- 
tive. □ 



Proof of Theorem 14. .41 
Proof that M(Oi, . . . , fljv) 7^ if > n m i n : 

For m € N set T m '■— {-^ '■ i£ We claim that the cardinality of 

Tn := (J T ni 

ie[iV-i] 

is given by |Tn| = n m i u . This follows by the inclusion- exclusion principle if 



= 000(71,4) (A e w), 



(6.3) 
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since 



U T* 

ie[N-l] 



E(-D |Ahl 



To prove (|6.3p . we set tjia '■= GCD(ua) and note that T ni D T mA (i £ A). Thus 

\n ie A T ^ \ > \ T m A \ = m A . 

To show the converse inequality | C\i£A ^nt | < ttia we note that for some rh £ N 
we have f] ieA T ni = T r ~ n . Thus for all i £ A there exist ti £ [ni\ with j^- = i- = 
min(T^j), or rii = ^m. Thus rh divides all rii (i £ A) and - being the largest such 
integer - equals tjia = GCD(n A )- 

Now we write Tq v in the form {di, . . . ,rf„ min } and set do :— 0, with ordering 
di > di-i (i £ [n min ]). The map 



$ : Tn v -> fiy , *(d*)< 



J 



i e [N - 1] 

i = AT 



is well defined, since [djTii] G [rii] (i € [AT — 1]), and by our assumption tijv > 
which implies j £ [tin]. The function 

p : fiy — > M 

is a probability distribution since dj — dj_i > and 



(6.4) 



{dj - dj-i) = d„ min -d = l. 
For all i £ [N — 1] and I £ [rii] the ith marginal probability equals 



u£x je[iv]\{.} ™i 

(d]-dj-i) 



rii 



e-i _ i 

rii rii 



We thus meet the condition of Theorem 3.2 showing that p £ A4(fli, 
Proof that M.(fl x , . . . , fijv) = if tin < n min : 



• The statement is trivial for N = 2 (remember that we assume n^+i > rii). 
Assume now that it is proven for all product spaces of at most N £ N units. 
Then for a probability distribution p £ .M(Qi, . . . , f2jv+i) consider its marginal 

P£V(n lN] ). 

We associate to p a A^-partite graph (V, E) whose vertex set is the disjoint 

• N 

union V := Ui=i^i- To every lo = (ui, . . . , cjjv) £ supp(p) C fijjv] belongs 
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Figure 4: A bipartite graph (V, G) for N + 1 = 3 units, with |fii| = 4, |0i| = 6, 
1^3 1 > 8 and a maximizer p £ A4(fli, f^, ^3). (V,G) has the two components 
C\ , C2 . 

the complete graph on the vertex set {o>i, . . . , ujn} C with edge set G w := 
{{uii, ujj} cV r |l<z<j< iV} on the N vertices u>i, . . . , ojn- Then the edge 
set 

E:= |J G w 

wSsupp(p) 

on y is indeed iV-partite. By the strict positivity (|4. 2|) of the p-marginals no 
vertex v £ V is isolated. 

• Every edge set G^ C E is contained in the induced subgraph of exactly one 
connected component C C V of the graph (V, E). We attribute to G w the 
weight and to a connected component C of the graph (V", E) the sum of 
the weights of the G u contained in it. These weights w(C) of the connected 
components C are not arbitrary numbers in (0,1]. Instead, we know from 
Theorem 14. .31 that the marginal distributions pi : Oj — > [0, 1] of p (and thus of 
p, too) have the Laplace form 

Pi(oJi) = — (i £ [N], S f2j). 

Therefore w(C) is simultaneously an integer multiple of 1/n, (z e [AT]) 
and thus an integer multiple of GCD(n[jvj). This implies the upper bound 
GCD(ri[jy]) for the number of connected components C of the A-partite graph 
(V,E). 

• For the case of N + 1 = 3 units this already suffices to show the bound 
"3 > Timin = n i + n 2 — GCD(ni,ri2). In this case the complete graphs are 
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of cardinality \G U \ = (N - 1)! = 1 so that \E\ = |supp(p)|. 
In general a graph on a vertex set of v £ N vertices with e € No edges has at 
least max(v — e, 1) connected components. In the case at hand v — ri\ + n-z, 
and there are at most GCD(ni,n2) connected components. So 

«3 > |supp(p)| > |supp(p)| = \E\ = e 

> v - c = (m + n 2 ) - GCD(ni, n 2 ) — n min . 

• For arbitrary iV + 1 > 3 this argument must be modified, since then jG^I = 
(N — 1)1 > 1. 

First of all we can substitute by any spannning tree C G w , and still the 
connected components C of (V", £") with E' := LLesupp(p) coincide with 
the connected components C of (V, E). Each of these spanning trees has only 
\T U \ — N — 1 edges. However in general E', too is not a disjoint union of the 
T u . 

We thus decompose the set supp(p) into a disjoint union 

N 

supp(p) = (J A k , (6.5) 
fe=i 

beginning with an arbitrarily chosen set of representatives w G C of the 
connected components C C f2[jv]- The estimate on the number of these com- 
ponents implies |Ajv| > GCD(n[jvi), and for w ^ w' e the edge sets G^ 
and G w ' are disjoint. 

Next we arrange the elements u>' S G of the connected component G contain- 
ing ui E An in the form of a spanning tree, with G u ' H G w " ^ for {u/, w"} 
being an edge of that tree. For u>' = (u)[, . . . ,uj' n ) e G of distance d(u/) from 
w G An we put u/ € Afc if there are exactly fc indices i € [N] with not being 
equal to any u" for ui" — (ui", . . . , lj^v) with d(ui") < d(ui'). This indeed gives 
a partition of the form (|6.5p . 
Then by our induction hypothesis 

|A fc | > (-l) |B| - fc ('^)GCD(n B ) (k = l,...,JV). (6.6) 

BC[JV] ^ ' 

B|>fc 

Namely for k = N (|6 . 6() reduces to > GCD(n[Ar]) which has been shown 
to be true. So if (|6.6|) would not hold, for the smallest k < N violating 
(|6.6j) . we would find a, B C [TV] of cardinality \B\ = k < N, whose marginal 
distribution ps has support of cardinality rifc+i := |supp(ps)| < n mul (B) = 
EBCB(-l) |S|_1 GCD(nB), see below. 

But this would contradict our induction assumption, since then the system 
Q := ( Xigs [rii]J x [ftfc+i] would have the optimizing probability distribution 
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for some bijection e : [ftfe+i] — > supp(ps), but yet not meet the criterium 
n k+ i > n min (B). 

Summing the cardinalities (|6.6[) . we obtain 

N [N] 

k 



|supp(p)| = £|A fc |>£ (-l) |B| - fc ('f)GCD(n B ) 

fe=l k=l BC[1»] 

\B\>k 



k 



= ^(-l)l s lGCD(n B )^(-l) 

BC[J»] fc=l 

= ^ (-l)l B| - 1 GCD(n B )=n roin , (6.7) 

BC[N] 

which is the induction step. □ 
Proof of Lemma 14.. 71 

If n\ = n-i then the maps 7r G 5 are isomorphisms 7r : f2 2 - * £~2i , so that a -< tt only 
for a = 7r. Thus in that case 5 is connected iff |<S| = 1, i.e. m = n% = 1. This 
contradicts our assumption n\,n2 > 2. 

If n 2 > ni and 1 7r — 1 (oji ) | > 1 for ir G S and some u)\ € fii, say 7t(cj 2 ) — w ii then 
a ~< 7r for 

7r(w 2 ), if ^2 7^ ^2 



a eS, er(w 2 ) : = 



, if U>2 = U>2 



So we need only show that any 7r',7r" £ 5 which are injective onto Qi are indeed 
connected. 

1. In the first step we move w' along the poset graph in order to decrease the 
cardinality of the symmetric difference (7r') _1 (0)A(7r") _1 (0). So we assume 
that there exist 

J G (7r')" 1 (0)\(7r")" 1 (0) and J 1 G ( 7 r")" 1 (0)\(7r')" 1 (0) 

and set 

, if uj = uj" 
ttgS, tt(uj):={ 7r'(u/'), if w = uj' 



7t' (w) , otherwise. 



Both it' and tt are covered by 



' ^ ' 1 7r '( w ) j otherwise, 

and 

|7r- 1 (0)A( 7 r")- 1 (0)| = |(7t')- 1 (0)A( 7 t")^ 1 (0)| - 2. 
By iterating the argument we can assume w.l.o.g. that (7r') _1 (0) = (7r") _1 (0). 
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2. In fact it is sufficient to treat the case where the permutation 

is a transposition, as the transpositions generate the symmetric group. So 
there exist uj ^ uj 11 S ^ with 

f ttV), if^ = co // 
[ 7r'(o;) , otherwise, 

and we choose £j € SI2 so that 7r'(u>) = 7r"(u)) = 0. 
Defining p, p" G 5 by 

( tt'(w /7 ), ifw=u> f vr"(w 7 ), ifw = u> 

p'(w) := < , iluj=u 11 resp. p"(w) := < , if uj = uj 1 
[ n'(oj) , otherwise [ ^"(uj) , otherwise, 

7r' and p' are covered by a' £ S and similarly ir" and p" are covered by a" £ 5 
with 

ni \ .— { 7r '( tj// )i if ^ = ^> "M •— I 7T "^ ujI ^ if w = a) 

CT ' I 7T '( UJ ) 7 otherwise res P- 17 l w J ■ | tt"{uj) , otherwise. 

Now as Tr'(uj n ) — 7r"(aj 7 ), both p' and p" are covered by 

f 7r'(w 7/ ), if UJ=LU 

t£ 5, t(w) := < tt'^) , if w = uj 11 
y n'(u>) , otherwise. 

This shows that the poset graph is connected. □ 



Proof of Theorem [4^81 

To simplify notation, we set j\4 := A4(Oi, f^), and M.^ := .M^Oi, ^2) for 7r 6 5. 

(1) We have M T C since for the elements of A'J^ the characterisation of Theorem 
!4..3l hold true. Furthermore for a, it € 5 with a ^= n there exists (a>2, wi) € graph(7r) 
with (o>2, tfi) ^ graph(cr) or vice versa. Thus for p 6 we have p(k>i, W2) > but 
for p £ A4 CT we have p(wi, cj 2 ) = showing that j\4 n fl A4 CT = 0. 

Finally for p £ M. by Theorem 14. .31 there exists a surjective map 7r : Sl2 — ^ f^i with 
p{u)\,u)<2) — whenever #(0*2) 7^ <^i- Given tx, we construct 7r €E S by setting 

, n _ / tt(w 2 ), if p(7r(w 2 ),^2) > 
^^^■"l , ifp(7r(«a),W2)=0. 

As by Theorem 14. .3l we have X^e*- 1 ^) K^i) W2 ) = 717 > 0> * ne function tt : Sl 2 — * 
£l\ so constructed has the property ^(^2) D f2i making it an element of S*. 

(2) Given u>i £ 0,%, the simplex of |7r _1 (wi)| numbers p(uJi,0J2) > with U2 £ 
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7T 1 (cj 1 ) meeting E^^-i^) ^2) = ^7 has dimension |tt 1 (u> 1 )\ - 1, implying 
the formula for dimA^- 

If dimA^Tr = I — tii, the surjective map fc : CI2 — > ^1 with H2 '■= tt (f2i) C SI2 and 
7T := 7r |jj 2 is defined on a subset VL2 C SI2 of size I. There are precisely (™ 2 ) such 

subsets, and there are precisely n\\Si ni such surjective maps from 0,2 onto f2i, see 
Aigner [Jo], Chapter 3.1. 

(3) If m — ri2 then S coincides with the set of bijections tt : — > fii, and l-M^I = 1. 
Thus in this case .M is not connected for m > 2. If, however ri2 > ni, the poset S, 
seen as a graph, is connected. 
The topological closure of M.^ is given by 

M„ = < p G V(Qi x fi 2 ) : X! P(wi,^ 2 ) = — ,p(wi,w 2 ) = if tt(w 2 ) 7^ wi > 

^ w 2 Gjt _1 (wi) J 

Thus M„ = l+J^AV □ 

Proof of Corollary 14. .101 

All statements directly follow from Theorem 14. .31 . □ 



Proof of Theorem I4..111 

We choose a map <j> — ((pi, . . . , <j) n ) : fly — > t n such that the points <f>(ui), u> € fiy, 
are in general position; that is, each k elements of <f>(Clv) with k < n + 1 are affinely 
independent. This property guarantees that for each set £ C fiy, |E| = n, there 
exist real numbers a\, . . . , a n , b such that 

jweQy : Y^a l (jj l (uj) = b^ = E (6.8) 

holds. We consider the exponential family Q* that is generated by c and 
(j)i,...,(j) n , (j>i(j)j (l<i<j<n). 

We have 

dimCT < 

Now let p be an element of Ai(Nx n). From Theorem 14 .. 1 01 we know that |suppp| = 
n. We prove that there exists a sequence in Q* that converges to p. We choose a 
sequence f3 m j 00 and real numbers ai, . . . , a n , b satisfying (|6.8[) with £ = suppp. 
Then with 



the sequence 



exp£( m > . 



E^ e n v exp£M(a;0 
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converges to p. □ 
Proof of Theorem 15.. li 

Using def. (^]) . we consider for A := {{1, N}, {2, AT}, . . . , {N - 1, N}} C 2^ the 
linear subspace 

of pure pair interactions of the TVth unit with all other units. The exponential family 
J-* := exp(Z,4) C P(f2y) is of dimension 

JV-l 

dim(.F*) = (n N - 1) (n, - 1), 

as asserted in Theorem l5..1[ 

Given a maximizer p G A4(fti, . . . , ^at), we now construct a sequence of proba- 
bility distributions 

:= exp(/( m >) ef* (to G N) 

and show that linim^oo g^™- 1 = p. 

Here the functions G Za arc defined as the orthogonal projections onto 2a 

of /(™) g x( 2 ) 

f{m)( , JJ X m + ln(pW(^) + l/m) 

For G supp(p) 

g( m )(w) 



exp 



+ In (V^W) + -)}-\ m + ln (V^KO + - 
\ m J J \ to 



g( m )(w') 



in accordance with (|4.3[) . 

On the other hand if a/ G supp(p) but lu ^ supp(p), then there is an i G 
{1, . . . , N — 1} with lun 7^ 7Tj~ 1 ((jJi) or p^ n \ujn) = 0. In both cases 

hm . = 0, 
m — >oo q\ rn >(td ) 

again in accordance with (|4.3[) . As the p( m ) are probability distributions, we have 
shown that linv^oo q( m ' ~p. □ 
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